ebGSEA 0.1.0
Gene Set Enrichment Analysis is one of the most common tasks in the analysis of omic data, and is critical for biological interpretation. In the context of Epigenome Wide Association Studies, which typically rank individual cytosines according to the level of differential methylation, enrichment analysis of biological pathways is challenging due to differences in CpG/probe density between genes. Here we propose an empirical Bayes Gene Set Enrichment Analysis (ebGSEA) algorithm, which does not rank CpGs but genes according to the overall level of differential methylation of its CpGs/probes, allowing unbiased and sensitive detection of enriched pathways. ebGSEA serves as a useful GSEA tool for EWAS that use Illumina HM450k and EPIC beadarrays. For details please refer to our publications listed at the end of this tutorial(Dong et al. 2019).
We used a HM450k buccal swab dataset(Teschendorff et al. 2015), which contains 104 buccal swab samples with smoking-pack-years as phenotype. The beta value matrix has been processed, with samples by column and probes by row.
load("/mnt/local-disk/data/zhutianyu/ebGSEA/trainBUC.rda")
dim(data.m)
## [1] 484272 104
Firstly, we do global test(Goeman et al. 2004) to rank genes according to overall differential methylation level of mapped CpGs. Here are the inputs for doGT function:
Because the function will take a few miniutes to run, we have stored the output of doGT in the pacakge. You can get the result by data("sgtm")
library(ebGSEA)
##
## sgt.m <-doGT(pheno.v,data.m,array="450k",ncores=20)
data("sgtm")
We can get the following matrix ordered by statistic for all genes tested:
dim(sgt.m)
## [1] 18618 5
head(sgt.m)
## p-value Statistic Expected Std.dev #Cov
## 1545 4.964735e-07 11.851194 0.9708738 0.7611331 38
## 100313886 1.049926e-04 11.044014 0.9708738 1.0936538 5
## 386757 6.694333e-04 10.774705 0.9708738 1.3663415 1
## 1543 2.860083e-05 10.163787 0.9708738 0.8705077 35
## 239 1.952693e-04 9.980009 0.9708738 1.0607971 24
## 3426 7.105079e-04 9.920837 0.9708738 1.2614460 5
Then we can apply doGSEAwt function to do pathway enrichment analysis with Wilcox test and Knwon-Population Median Test, here are the input parameters:
doGT function, with genes by row and ranked by statistics from global test. Rownames of the matrix should be gene EntrezID.data("MSigDB-28Feb14-data").data("MSigDB-28Feb14-data")
topGSEA.lm <- doGSEAwt(rankEID.m = sgt.m, ptw.ls = listEZ.lv, ncores = 10, minN = 5,adjPVth = 0.05)
## Running Wilcox Test and Known Population Median Test...
## Done
## 'select()' returned 1:1 mapping between keys and columns
The result is a list of three objects:
There are 2240 enriched pathway identified for this sample data.
summary(topGSEA.lm)
## Length Class Mode
## Rank(P) 11200 -none- numeric
## Rank(AUC) 11200 -none- numeric
## Genestat 2240 -none- list
In the first of two objects we can view the following information for each enriched pathway rank by adjusted wilcox test p value or AUC:
head(topGSEA.lm$`Rank(P)`)
## nREP AUC P(WT)
## CAGGTG_V$E12_Q6 2314 0.5931273 4.412916e-48
## DODD_NASOPHARYNGEAL_CARCINOMA_UP 1504 0.6089612 4.822036e-45
## TGANTCA_V$AP1_C 1059 0.6179386 1.946188e-38
## ONDER_CDH1_TARGETS_2_DN 441 0.6730179 8.322979e-36
## JAEGER_METASTASIS_DN 249 0.7272961 2.727236e-35
## MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3 1014 0.6057309 4.125484e-30
## P(KPMT) adjP
## CAGGTG_V$E12_Q6 6.763456e-34 3.780545e-44
## DODD_NASOPHARYNGEAL_CARCINOMA_UP 1.984042e-32 2.065519e-41
## TGANTCA_V$AP1_C 2.268952e-31 5.557665e-35
## ONDER_CDH1_TARGETS_2_DN 1.885097e-25 1.782574e-32
## JAEGER_METASTASIS_DN 7.821026e-30 4.672847e-32
## MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3 1.959189e-19 5.890503e-27
From the results we can see the biological term nasopharynx carcinoma is highly enriched. Nasopharynx carcinoma is a cancer that is smoking related and the nasopharynx is exposed to smoke carcinogens.
The third list Genestat consists of the statistic and p-value from global test, of each gene in a specific enriched pathway. The genes in the first pathway:
head(topGSEA.lm$Genestat[[1]])
## Pvalue Statistic
## MEF2C 0.03455165 3.0828456
## FAM126A 0.19456354 1.3593731
## SCN3B 0.60704648 0.7065586
## MAGI3 0.03416454 3.3049866
## RORC 0.01658052 4.5841419
## SYNCRIP 0.04887433 2.3875709
We can plot the AUC and adjP for each enriched pathway:
plot(x = topGSEA.lm$`Rank(P)`[,2], y = -log10(topGSEA.lm$`Rank(P)`[,5]), xlab ='AUC', ylab = '-log10(adjP)', main = 'AUC and adjP for each enriched pathway', pch = 21, bg = 'red')